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Abstract 

Background: Due to tine importance of Penicillium chrysogenum liolding in medicine, the genome of low-penicillin 
producing laboratorial strain Wisconsin54-l 255 had been sequenced and fully annotated. Through classical 
mutagenesis of Wisconsin54-1255, product titers and productivities of penicillin have dramatically increased, but 
what underlying genome structural variations is still little known. Therefore, genome sequencing of a high- 
penicillin producing industrial strain is very meaningful. 

Results: To reveal more insights into the genome structural variations of high-penicillin producing strain, we 
sequenced an industrial strain P. chrysogenum NCPC10086. By whole genome comparative analysis, we observed a 
large number of mutations, insertions and deletions, and structural variations. There are 69 new genes that not 
exist in the genome sequence of Wisconsin54-1255 and some of them are involved in energy metabolism, 
nitrogen metabolism and glutathione metabolism. Most importantly, we discovered a 53.7 Kb "new shift fragment" 
in a seven copies of determinative penicillin biosynthesis cluster in NCPC10086 and the arrangement type of 
amplified region is unique. Moreover, we presented two large-scale translocations in NCPC10086, containing genes 
involved energy, nitrogen metabolism and peroxysome pathway. At last, we found some non-synonymous 
mutations in the genes participating in homogentisate pathway or working as regulators of penicillin biosynthesis. 

Conclusions: We provided the first high-quality genome sequence of industrial high-penicillin strain of P. 
chrysogenum and carried out a comparative genome analysis with a low-producing experimental strain. The 
genomic variations we discovered are related with energy metabolism, nitrogen metabolism and so on. These 
findings demonstrate the potential information for insights into the high-penicillin yielding mechanism and 
metabolic engineering in the future. 




Genomics 



Background 

Penicillin and P -lactam antibiotic play a significant role in 
human medical history [1,2] since Fleming's discovery of 
the filamentous fungus Penicillium notatum in 1929 [3]. 
The regulation of penicillin biosynthesis has been studied 
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for many years, together with much more proteins or 
pathways were discovered [4-9]. The improvement of 
P. chrysogenum strains to obtain higher penicillin yields 
is a main intense objective in industrial research [10,11]. 

Due to the importance of P. chrysogenum, the genome 
sequence of low-penicillin producer Wisconsin54-1255, 
which is widely used in laboratories, was sequenced and a 
number of genes responsible for key steps in penicillin 
production were identified [12]. The precursors for peni- 
cillin biosynthesis, genes encoding microbody proteins and 
transporters were found, illustrating potential for future 
genomics-driven metabolic engineering [12]. Through 
classical mutagenesis and screening methods, product 
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titers and productivities of penicillin have dramatically 
increased since Wisconsin54-1255 strain, but how low- 
penicillin producer strain was transformed into an efficient 
producer through improvement is still challenging. For 
commercial reasons, the improvement of P. chrysogenum 
strains has never been stopped. The productivity of indus- 
trial used strains is far more higher than their ancestor, 
and the progress was mainly obtained by classical muta- 
genesis and screening methods. Because mutations were 
random, most of the genetic changes in high yield strains 
were unclear. Although some significant structural varia- 
tions (SVs) [8,9,13] and differential expression profiling 
[12,14,15] have been found in high-penicillin producing 
strains, little is known about the underlying whole geno- 
mic changes between low-producing laboratorial strain 
and high-producing industrial strain. 

To gain more insight into the genome structural varia- 
tions of high-penicillin producing strain, we sequenced a 
Chinese industrial strain NCPC10086. We also offer a 
comprehensive comparative genomics analysis [16-19] 
to find all mutations and large-scale structural variations 
between NCPC10086 and the first published genome of 
P. chrysogenum strain, Wisconsin54-1255 [12]. Some 
variations including mutations, indels and structural var- 
iations were considered for their potential biological 
impact for penicillin biosynthesis. Our genome sequence 
data and analyses explore the differences between high- 
and low-yield P. chrysogenum strains and demonstrate 
the potential useful information to improve strains by 
direct genetic engineering tools. 

Results 

Genome sequencing, assembly and general characerlstlcs 

We sequenced the genome of P. chrysogenum NCPC10086 
using a whole-genome shotgun sequencing strategy 
[20,21]. Owing to different sequencing technologies have 
various advantages and disadvantages [22,23], we gener- 
ated a high quality genome assembly using a combination 
of first and second generation sequencing platforms and 
strategies (Table 1). First, we generated single-end reads 
using Roche 454 pyrosequencing platform [24] and mate- 
pair reads with 3-4 Kb and 6-8 Kb insert fragment sizes, 
using ABI 3730 and MegaBACE 1000 Sanger sequencing 



platforms [25], respectively. Then we generated mate-pair 
reads with 1-2 Kb insert fragment size using lUumina 
HiSeq 2000 sequencing platform [26,27] and used all 
mate-pair reads to join contigs into scaffolds. Overall, 
we get 204x sequencing coverage of high quality reads for 
de novo assembly (Table 1). 

We got a total genome size of 32.3 Mb (Table 2) similar 
as Wisconsin54-1255 [12]. The length of longest contig is 
1,655 Kb, which indicates fine continuity of assembly. 
Owing to the deeper sequencing data, the contig N50 of 
NCPC10086 is 661 Kb which is 70% higher than Wiscon- 
sin54-1255 (389 Kb). The scaffold N50 of our genome is 
2.8 Mb and the longest scaffold is 4,063 Kb, illustrating our 
genome is suitable for structural variations detection, espe- 
cially large-scale translocations. The gene structures were 
predicted with a combined de novo and homology-based 
approach. Firstly, we masked all the repeat sequences in 
the genome and used Fgenesh (v 3.1.2) [28] and GeneMark- 
ES (v 2.3e)[29] to provide an initial set of 11,284 predicted 
ORFs. Secondly, we took advantage of the gene prediction 
results of Wisconsin54-1255 [12] to revise and comple- 
ment our predicted genes by homology searches. At last, 
the former two results were integrated together, and 
13,290 protein-coding genes were predicted in P. chryso- 
genum NCPC10086 genome (Table 2). GC content of the 
genome is 48.9% and every 2,430 bp has one gene. The 
mean gene length is 1,499 bp and most of the genes have 
introns. 

As to gene annotation, four databases of Non-redun- 
dant (NR), InterProscan [30,31], Swiss-Prot/UniProtKB 
[32,33] and Gene Ontology system (GO) [34,35] were 
used to annotate 13,290 predicted genes of NCPC10086 
(Figure lA). We found 12,906 genes have homolog in 
NR, 9,371 genes have protein structural domains in 
InterProscan, and 7,625 genes have homolog in Swiss- 
Prot/UniProtKB, and 6,831 genes can be classified in 
GO. There are 6,140 genes can be annotated by all four 
gene annotation systems (Figure lA), suggesting these 
genes are well studied. The second large genes group is 
NR-specific with the number of 3,510, which indicates 
functions of these genes are just beginning to be under- 
stood. The following two large groups are 1,362 genes 
with homolog in Swiss-Prot database and 1,081 genes 



Table 1 P. chrysogenum NCPC10086 genome sequencing data 


Instruments 


Insert fragment size (Kb) 


Reads length 

(bp) 


Sequencing throughput (Mb) 


Coverage 


Roche 454 GS 


single end 


410 


614 


18x 


lllumina HiSeq 2000 


1-2 


50 


6,120 


180X 


ABI 3730 


3-4 


659 


170 


5x 


MegaBACE 1000 


6-8 


739 


34 


1x 


Total 






6,938 


204X 



The estimated genome size of P. chrysogenum NCPC1 0086 is about 34 Mb. 
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Table 2 Global statistics of the genome assembly and annotation of P. chrysogenum NCPC10086 



Assembly 


Number 


N50 (Kb) 


Longest (Kb) 


Size (Mb) 


Percentage of the assembly 


Contigs 


327 


661 


1,655 


322 


99.7 


Scaffolds 


175 


2,847 


4,063 


32.3 


100 


Annotation 


Number 


Mean length 


GC content (%) 


Gene density 












(1 gene every n bp) 




Coding genes 


1 3,290 


1,499 


48.9 


2,430 




Genes with intron 


1 0,966 


1,559 


51.6 


2,945 





A 



InterProscan 



Swiss-Prot 





Cellular Component Molecular Function Biological Process 



Figure 1 Gene annotation and gene ontology of P. chrysogenum NCPC10086. (A) Venn diagram showing unique and shared proteins could 
be annotated by databases of Non-redundant, InterProscan, Swiss-Prot/UniProtKB and Gene Ontology. (B) There are 6,831 proteins could be 
assigned to cellular component, biological process and molecular function by Gene Ontology classification system. 
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with protein domain in InterProscan. Both of them can- 
not be assigned to GO system, indicating functions of 
these genes are little known but ready for deeper inves- 
tigation in some extent. There are 6,831 genes can be 
assigned at least one GO term for describing cellular 
component, biological process and molecular function 
classification (Figure IB). 

Genome comparison analysis between P. chrysogenum 
NCPC10086 and Wlsconsin54-1255 

P. chrysogenum Wisconsin54-1255 is a low-penicillin 
producer strain widely used in libraries. The genome 
sequence of Wisconsin54 was well sequenced in 2008, 
which is the first published genome of P. chrysogenum 
[12]. The evolutionary relationship between Wiscon- 
sin54-1255 and our sequenced high-penicillin strain 
NCPC10086 is very close. Here we offer a comprehen- 
sive genome comparison analysis between these two 
P. chrysogenum strains, and try to figure out some inter- 
esting genomics discrepancy. To access comparative 
genomics statistics results, we aligned all the genes of 
Wisconsin54-1255 to the scaffolds of NCPC10086 to 
detect gene mutations. There are 12,943 predicted genes 
in Wisconsin54-1255 and 13,290 predicted genes in 
NCPC10086. And we discovered 11,573 genes are iden- 
tical between two strains, 89% for Wisconsin54 and 87% 
for NCPC10086, which indicates these two genomes are 
very conservative (Figure 2A). As to the non-identical 
genes, 1,154 genes' identity are higher than 90%, part of 
them leading to mutations, and only 22 genes have less 
than 60% identity. In addition, we complemented 64 
genes, which are partial in Wisconsin54-1255 and 514 
noncoding sequence regions are redefined as protein- 
coding genes. Using Wisconsin54-1255 genome 
sequence as a reference, we realigned 50x high-quality 
short reads of NCPC10086 from lUumina HiSeq 2000 to 
identify single nuclear variations (SNVs). In order to dif- 
ferentiate sequencing errors from mutations, we used 
three thresholds to filter out unreliable mutation results: 
(1) we required at least five reads for each mutation; (2) 
average quality of each mutation had to be higher than 20; 
(3) there had to be at least one pair of mate-pair reads to 
support. We got 759 SNVs in coding regions and 1,813 
SNVs in noncoding regions. There are 135 SNVs in intron 
regions, 177 are synonymous mutations and 447 are non- 
synonymous mutations, including 34 termination codon 
mutations. All SNVs result in coding region is described 
as Figure 2B. Furthermore, we found 35 deletions and 19 
insertions in exon regions, 86 copy number variations 
(CNVs) with the total size of 176,684 bp. The polymorphic 
genes with non-synonymous mutations are listed in Addi- 
tional file 1. 

Besides, we found 69 new genes that do not exist in the 
available genome sequence of Wisconsin54-1255 [12]. We 




Figure 2 The single nuclear variations (SNVs) statistics between 
NCPC10086 and Wisconsin54-1255. (A) We discovered 11,573 
genes are identical and 759 SNVs between two strains. (B) Among 
tliem, 135 SNVs tal<e place in intron regions, 177 SNVs are 
synonymous mutations and 447 SNV are non-synonymous 
mutations including 34 termination codon mutations. 



analyzed these new genes carefully and figured out several 
metabolism, biosynthesis and degradation (Table 3). 
Firstly, Pchl25gl0680, Pchl06g00010, Pchll4g00050 
genes involved in amino sugar, nucleotide sugar metabo- 
lism, N-Glycan biosynthesis and oxidative phosphorylation 
may provide more energy in high-penicillin producer for 
penicillin synthesis. Secondly, we found another new gene, 
Pchl06g00010, involved in nitrogen metabolism, which is 
up regulated strongly in cultures supplemented with the 
side chain precursor PAA (phenylacetic acid) in high- 
producing strain [36] . The last but not the least, the new 
gene, Pch018g00010, was discovered to participate in 
glutathione metabolism that may boost production of 
penicillin. The biosynthesis of cysteine is precursor for the 
penicillin biosynthesis, and the genes involved in this path- 
way were over expressed in the high-penicillin producing 
strain [12]. As to the previous study [37], the increase in 
the cysteine biosynthesis requires a large NADPH supply, 
but the oxidized glutathione under oxidative stress also 
requires NADPH, which could reduce the cysteine bio- 
synthesis. So, we presume that glutathione metabolism 
may save NADPH and indirectly promote the penicillin 
production. 

To our best knowledge, the penicillin biosynthetic 
genes cluster (hereafter named PBC) located at chromo- 
some I in P. chrysogenum is the dominant core for 
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Table 3 Metabolism or progress Involved by several 


"new" genes 


G6ne name 


LenQth 


Location 


The metabolism or progress 




(bp) 


(bp) 




Pchl25g 10680 


4980 


scaf 1 25 


MIIIIMU ^ugdl clliu llUCICUllUC SUydl IIICLaUUIISIII 










Pchl06g 00010 


769 


scaf 1 06 


Nitrogen metabolism, oxidative phosphorylation 






1, 1 yj 1 ) 




Prhi i4nnnn'^n 

rLIil IHyuUUJU 




scaf 1 1 4 


OvlHati\/£a rih^criK^A/l^ti/in 
WAlUdLlvc pi iUS[JMUiyiallUM 






(16,204-16,356) 




Pch041g00010 


713 


scaf041 


Riboflavin metabolism 






(112-824) 




Pch056g00010 


787 


scaf056 


N-Glycan biosynthesis 






(36-822) 




PchOlSgOOOlO 


694 


scaf018 


Glutathione, arachidonic acid, taurine and hypotaurine metabolism, 






(325-1,018) 




PchlSOgOOOlO 


580 


scaf 180 


Fluorobenzoate, chlorocyclohexane and chlorobenzene, toluene degradation 






(50-629) 





penicillin production [9,38], which exist in all strains, 
including NRRL 1951 (wild-type) [39], Wisconsin54-1255, 
NCPC10086, P2 (Panlabs), AS-P-78 and El (Antibioticos, 
S.A.) [13]. Three penicillin biosynthetic genes, pcbAB, 
pcbC and penDE are gathered in this cluster (Figure 3A). 
Copy number and fragment arrangement are key features 
about PBC, which could impact on the yield of penicillin. 
PBC was amplified five to sixteen copies in different high- 
penicillin producers, such as five or six copies for AS-P-78 
and twelve to fourteen copies for El [13] (Figure 3B). 
As to the low-productive strain Wisconsin54-1255, PBC 
just has one single copy with the length of 56.9 Kb, 
consisting of a 53.7 Kb fragment and a 3.2 Kb shift frag- 
ment bounded by a conserved TGTAAA/T hexanucleo- 
tide [8] (Figure 3A). Through reads coverage detection 
method, we found seven copies of PBC in NCPC10086 
with the length of 56.9 Kb, including two copies for 
110.6 Kb fragment of D-EG'-F'E'-D' and 56.9 Kb frag- 
ment of E'-DT-G, and one copy for 56.9 Kb fragment of 
E'-D'G'-F' (Figure 3B). Compared with the previous 
investigations [8,9,13,40], one PBC fragment arrangement 
in NCPC10086 is unique and has never been reported 
before, orange bar and blue letters shown in Figure 3B. It 
is a 53.7 Kb "NEW SHIFT FRAGMENT" in our genome. 
We believe the TTTACA hexanucleotide and its inverse 
complement TGTAAA could be hot spots for site-specific 
recombination after mutation with nitrosoguanidine. 
Unfortunately, the length of PBC is so long that we cannot 
get the full precise arrangement of these copies. 

The genes involved in fungal secondary metabolic 
pathways share a common tendency towards physical 
cluster, with a preference for subtelomeric regions [41]. 
The wide range of translocation is a very interesting 
phenomenon happened in industrial filamentous fungus, 
for example, Aspergillus niger [16]. Comparison with 
Wisconsin54-1255, we found two large fragment trans- 
locations in NCPC10086. One is a 266 Kb fragment in 



subtelomere (position 5,739,429-6,005,263 in scaffold22 
of Wisconsin54-1255) transferred to the around centro- 
mere in NCPC10086 and the range of translocation is 
about 3 Mb (Figure 4A). To avoid the possibility of assem- 
bly error, we realigned mate-pair reads around the break- 
point of translocation and found enough reads to across 
the breakpoint (Figure 4B). Moreover, we did PCR identifi- 
cation around the breakpoint and only NCPC (N) has 
band with estimated size (Figure 4C). This 266 Kb frag- 
ment of translocation includes 107 genes from Pc22g24360 
to Pc22g25450 and their mean genes size is 1,638 bp. 
Unfortunately, the functions of most genes are unknown 
except for Pc22g24480 {me), which encodes a regulator of 
nitrogen metabolite repression (red boxed in Figure 4A). 
We hypothesize that the translocation of Pc22g24480 may 
promote the nitrogen metabolism in NCPC10086, corre- 
sponding to the "new" gene discovered in nitrogen metabo- 
lism, which is described earlier in this article. Another 
translocation is a 1,202 Kb fragment (position 387,513- 
1,589,522 bp in scaffoldl8 of Wisconsin54-1255) trans- 
ferred to scaffold072 of NCPC10086 starting at position 
148,434 bp (Figure 5A). The 1,202 Kb fragment consists of 
494 genes from Pcl8g01610 to Pcl8g06720 and their mean 
size is 1,480 bp. Those genes associate with energy metabo- 
lism and peroxisome pathway, such as Pcl8g02420, which 
encodes mitochondrial ADP/ATP carrier, and Pcl8g02550, 
which encodes PEX-2 (red boxed in Figure 5A). We also did 
mate-pair reads alignment around the breakpoint of translo- 
cation (Figure 5B) and PCR identification (Figure 5C) to 
certificate this translocation. 

At last, we focused on some genes involved in homo- 
gentisate and the regulators of penicillin biosynthesis. 
Table 4 shows the comparison results of these genes 
between Wisconsin54-1255 and NCPC10086. PahA gene 
encodes a phenylacetate 2-hydroxylase that catalyzes the 
first step of phenylacetate catabolism, decreasing the pre- 
cursor availability for penicillin biosynthesis [42]. Owing 
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( TGTAAA ) ( TGTAAT ) 
CD EG' F'H 
1 [5.065 K [5,060 K , |5,p55 K 15,050 K |5.p45 K 15,040 K |S,p3S K |5.p30 K |5,p25 K 15.020 K | | 



scaf2 



B 



NRRL1951 



Wisconsin54 



NCPC 10086 



AB 



AB 



AB 



CD 



CD 



pcbAB 

■53.7k 



pcbC penDEM 



I 



EF OH 



EG'FH 

^ NEW SHIFT FRAGMENT 
EG' F'E' / D' 




D'F G 



D'G' F'H 



---\-3.2k-\ 
shift fragment 



JUNCTIONS 
>AB 

GCTCC TTTACAC CATC 
>EG' 

TCAAG TTTACAT GGTC 
>EB 

TCAAG TTTACA CCATC 

>EF 

TCAAG TTTACA ACTAG 
>D0' 

CGTTTTTTACATGGTC 



AS-P-78 



P2 



El 



AB 



AB 



AB 




EB 



CD 



CD 



-unknown copies- 



EF GD 



1 or 2 copies - 
D 



EB 



EG' F'D 



I- 



-10- 12 copies ■ 



EF GH 



EG' F'H 



EG' F'H 



>D'F 

CGTTTTTTACAACTAG 



>CD 

GAGTTTGTAAAAAACG 



>F'E' 

CTAGTTGTAAACTTGA 



>F'D 

CTAGTTGTAAAAAACG 



>GD 

GACCATGTAAAAAACG 



>F'H 

CTAGTTGTAATTAAAT 



\ 

Figure 3 Comparative organizations of penicillin biosynthetic genes cluster (PBC) in different strains (A) PBC region of Wisconsin54-1255 
is about 56.9-kb, consisting of 53.7-l<b fragment and 32-kb sliift fragment bounded by a conserved TGTAAA/T hexanucleotide. (B) PBC fragment 
arrangement scliematic. We discovered a new shift fragment in NCPC10085, marl<ed witli orange bar and blue letters. 



to a point mutation (C — >T) in pahA gene, the homo- 
gentisate pathway for PAA cataboUsm has been reported 
to be largely inactivated but the penicillin yield is 
increased in Wisconsin54-1255 [42]. Based on the com- 
parative genomics analysis, we found pahA gene was 
shown in a 27.1 Kb translocation, but pahA gene is 



identical between the two strains. Three other genes, 
pahB (Pc22g0223G), pahC (Pcl6g01770) and pahD 
(Pc21g2256G), are strongly similar to pahA and all of 
them belonging to cytochrome P450 monooxygenases. 
For pahC gene, there is a mutation (C* C) causes a 
single amino-acid substitution at position 129 of the 
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B 

CTaTTCTIATTGGATSUrGCCQIIATCCTTGCTIUWCTCCS^TGGGGCCCXSUICJUITTACRCCIU^ 



CraiJTCTXAJTGGCAXjUZGCCGAXATCCTTGCTAAGCTCCGATGG GDCCaUUTAKTIACACCAJUXIY^AnTCGGCArXjU^GCArCU^TrrTGAI 
ClCAITCTIATTGGGA13U33CC»IATCCTrGCXAAGCTCCGAJGGGGa: CAACAATIACACaUUXTGAIArCGGaaiM^GCAJXM^TTTTGATakTG 
GOVXACGCCSOATCCITGCIAAGCICCaUGGGGarcCAACJIATlACAC TGAIATCGOCAIIAGOCAmGTTCTGATCATGT 

cisvi AncGC(X3viATcciTGciMGCicc»iGeGGka3nciaii&c»xa cGGoanGGcnmeTmsncaici 

ICTSIITCnATlOGianCGCaaXATCCTT ATCGUiLLUUICJUITnCACCIUCCTGAXATCGOCUnGGCATTaGTT TGATCATGT 

ICTSITTCTlATTOG»T3U:GCCSaATCCTTG GGG&CCJIACAATlACACCAJICCIGAlATCGGCAITAOGCATnGTTTlG 
CIQOTCTlArTGGQlTACGCCSaATCCTTGCTA OCCCAACAATTACACCAACCIGAIATCGGCAGTGGOCATTSGTTCTGATC 
CTSUTCTOTTGGmSCGCCraXATCCTTGCIAAGCTCCSUGGGG OIACJUTTACACCAACCTGAIATCGGCUTAGOCATTSGTTTIGATCSTG 
ATCCTTGCTAAGCTCCaUGGGGCCCX^UUIJUITTACaOCAACCTGAIATC GCATnGGCAmGlIllGATCATGT 
TCCrrGC130UXTCCC3lTGGGG(XCX3UICJU«nCCACCAflCCTGAiaTCG CATTAGTTTTGATCAJGT 
TTGClAAGCTCCGSUGGGGCSCCX^IACJUVTTACaGCAACCTGAIATCGGCA TTTGATCATGT 
GCraftGCnXOVIGGGIKCXSUVUAinaUDCMtCCTGjaUCGS GJUrCUGT 
I JUICAATTAOlCCAJICCTGAIArCGOCAITAGGCGrDlGTTTIGAlXIAIGr 




6()0bp 



Figure 4 Identification of 266 Kb translocation. (A) A 266 Kb fragment translocation (orange bar) between Wisconsin54-1255 and NCPCK 
Genes are marked with green bar; special one is red boxed. (B) Reads alignment of the region around the breakpoint of translocation shows that 
there are 11 reads to support our conclusion. (C) PGR identification of the translocation. W stands for Wisconsin54-1255 and N stands for 
NGPG 10086. 



protein: an alanine residue in strain Wisconsin54-1255 
has been substituted to proline in NCPC10086 strain 
(A129P). For pahD gene, we found a synonymous muta- 
tion (C^^^^— > T). All the promoter sequences of these 
genes are all the same. PclaeA gene was found as a global 
regulator of secondary metabolism, including penicillin 
biosynthesis, sporulation and pigmentation in P. chryso- 
genum [43], which is identical in sequence in the genome 
of Wisconsin54-1255 and NCPC10086 with the same 
copy number. Due to PclaeA gene encodes a velvet-like 
complex, we checked another velvet-like complex PcVeLA. 
Its corresponding PcvelA gene is a key regulator of metabo- 
lism, acting both as an activator and repressor of secondary 
metabolism in P. chrysogenum [44]. There is a mutation 
((-.1002^ T) in PcveM gene of NCPC10086 causes gluta- 
mine^^^ to stop codon (Q315Stop). Through functional 
domain scanning, we can see that position 28-238 is velvet 
factor non-mutated. 

Discussion 

We sequenced the whole genome of an industal high- 
penicillin producing strain NCPC10086 and provided an 
integral whole geome comparison results with Wiscon- 
sin54-1255. A total genome size of 32.3 Mb was assembled 
with contig N50 of 661 Kb and scaffold N50 of 2.8 Mb. 
The gene structures were predicted with a combined 



de novo and homology-based approach, and annotated by 
four gene annotation systems. 

By whole genome comparative analysis, we observed a 
large number of mutations, insertions and deletions, and 
structural variations. There are 69 "new" genes that not 
exist in the genome sequence of Wisconsin54-1255 and 
some of them are involved in energy metabolism, nitrogen 
metabolism and glutathione metabolism. As was expected, 
the high-penicillin producing strain needs more energy for 
penicillin synthesis, sorting, transport and processing, and 
we confirm some new genes participate in it. One "new" 
gene was discovered in nitrogen metabolism, which is up 
regulated strongly in cultures supplemented with the side 
chain precursor PAA (phenylacetic acid) in high-producing 
strain [36]. Both cysteine biosynthesis and the oxidized glu- 
tathione need NADPH, if glutathione metabolism is more 
active, NADPH could be reserved for more cysteine bio- 
synthesis to improve the penicillin production. Our "new" 
gene involved in glutathione metabolism may impact on 
this process. 

The penicillin biosynthetic genes cluster (PBC) is the 
well-known dominant core for penicillin production exist- 
ing in all strains; copy number and fragment arrangement 
are the key features for PBC. The high-penicillin producing 
strain, NCPC10086, has seven copies of PBC and one 
53.7 Kb "new shift fragment" with unique arrangement 
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Figure 5 Identification of 1,202 Kb translocation. (A) A 1,202 Kb fragment translocation (orange bar) between Wisconsin54-l 255 and 
NCPC10085. Genes are marked with green bar; special one is red boxed. (B) Reads alignment of the region around the breakpoint of 
translocation shows that there are 11 reads to support our conclusion. (C) PGR identification of the translocation. W stands for Wisconsin54-1255 
and N stands for NGPG10086. 



type. The TTTACA sequence and its inverse complement 
TGTAAA sequence could be hotspots for site-specific 
recombination after multiple mutations. This process may 
aim to repair damage from mutations by nitrosoguanidine. 
We found two large translocations in NCPC10086; one is a 
266 Kb fragment in subtelomere transferred to centromere 
including genes regulating nitrogen metabolite repression; 
another is a 1,202 Kb fragment consists of a mitochondrial 
ADP/ATP carrier involved in energy metabolism and 
peroxin-2 gene involved in peroxysome pathway. 



Due to our comparative genomics statistics results, we 
predicted that energy metabolism and nitrogen metabo- 
lism plays an important role in penicillin production 
together with glutathione metabolism and peroxysome 
pathway. To further analysis genes involved in those pro- 
cesses, we looked into two types of genes deeper, pahA 
gene set and velvet-like complex genes. Translocation, 
stop codon mutation, synonymous and non-synonymous 
mutations are found there. These variations may impact 
the homogentisate pathway for PAA catabolism as well as 



Table 4 Single nuclear variations (SNVs) involved in homogentisate pathway and the regulators of penicillin 
biosynthesis 



Gene Length Description 
(bp) 



pahD 2,423 Strongly similar to pahA 

PdaeA 1,340 A global regulator of secondary metabolism 
PcvelA 1,745 An activator and repressor of secondary metabolism 



Discrepancies 
(SNVs) 



pahA 1,727 A phenylacetate 2-hydroxylase which catalyzes the first step of the homogentisate pathway for PAA 
catabolism 

pahB 1,797 Strongly similar to paM 
pahC 1,785 Strongly similar to paM 



non-synonymous 
(G482G) 

synonymous 
(G1976T) 



non-synonymous 
(G 100271 
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global regulation of secondary metabolism, including peni- 
cillin biosynthesis, sporulation and pigmentation. 

We found out many mutations and structural varia- 
tions, but how many of them and how they affect the 
penicillin yield is still a formidable challenge. Efficient 
approaches to narrow down the possibilities are to 
sequence more genomes for common variations and 
system biological investigation using "omic" data [45]. 
Through genome resequencing and functional analysis, 
identification of precise mutations in strains with altered 
phenotypes will add insight into specific gene functions 
and guide further metabolic engineering efforts. 

Conclusions 

This is the first high-quality genome of high-penicillin 
producing industrial stain of Penicillium chrysogenum, 
which can provides abundant genetic information for 
broad biomedical researchers. Through comparative geno- 
mics analysis with low-producing strain, we found a lot of 
mutations, insertions and deletions, and structural varia- 
tions. Moreover, we showed some "new" genes not exis- 
tent in the public genome sequence of Wisconsin54-1255 
involved in energy metabolism, nitrogen metabolism and 
glutathione metabolism. Most remarkably, for the penicil- 
lin biosynthesis cluster, we are surprised to find a 53.7 Kb 
new "shift fragment" in our high-producing strain and the 
type of fragment arrangement is unique. In addition, we 
addressed a 266 Kb translocation including a regulator of 
nitrogen metabolite repression and a 1,202 Kb transloca- 
tion including genes involved in energy metabolism and 
peroxysome pathway. Our findings lay a foundation for 
the insights into the high-penicillin producing mechanism 
and metabolic engineering in the future. 

Methods 

Source of sample and culture conditions 

P. chrysogenum NCPC10086 strain was selected for gen- 
ome sequencing as it was commercialized in North China 
Pharmaceutical Group Corporation. Spore suspensions of 
NCPC10086 were inoculated in 40 mL of seed medium 
(20 g/L sucrose corn steep liquor, 20 g/L sucrose, 5 g/L 
yeast extract, 5 g/L CaCOs, pH 5.8) in 250 ml flasks and 
incubated on a rotary shaker (250 r.p.m.) at 26°C for 24 h. 
Two milliliters of seed culture were transferred to 40 mL 
of fermentation medium (35 g/L lactose, 30 g/L corn steep 
liquor, 5 g/L (NH4)2S04, 1 g/L KH2P04,1 g/L, K2SO4, 
10 g/L CaCOa, 2 g/L phenylacetic acid, 6 ml/L corn oil, 
pH 6.0) and grovm at 26°C with shaking at 250 r.p.m. 

Genome sequencing and assembly 

The genome of P. chrysogenum NCPC10086 strain was 
sequenced by whole genome random sequencing method 
[20,21]. We used Roche 454 GS TLX system to produce 
ISx coverage single-end reads with an average read length 



of 410 bp to do contig assembly. Moreover, 3-4 Kb and 
6-8 Kb mate-pair libraries were produced to do contig and 
scaffold assembly, with 5x coverage sequenced by ABI 
3730 system and Ix coverage sequenced by Mega- 
bacelOOO system. ABI 3730 and MegabacelOOO produced 
an average read length of 659 bp and 739 bp respectively. 
Phred and Phrap [46,47] were used to deal with the raw 
data from ABI 3730 and MegabacelOOO. Hybrid assembly 
was performed using Newbler [24] with all single-end and 
mate-pair reads by overlap-layout-consensus strategy. 
After assembly, we aligned our contigs to the reference 
sequence [12] to predict the gaps sizes distribution in our 
genome. According to the gaps size distribution, we 
designed 1-2 Kb mate-pair inserted fragment library to do 
scaffolding. We mapped the mate-pair high quality reads 
onto the scaffolds and used the reads to fill the gaps if one 
of the mate-pair reads located at the edge of gaps. At last, 
redundant sequences were deleted through self-alignment. 

Gene prediction and annotation 

The repeat sequences of P. chrysogenum NCPC10086 were 
masked throughout the genome using RepeatMasker 
(v 3.3.0) and the RepBase library (20120418) [48] with 
Tandem Repeats Finder (v 3.2.1) [49]. The gene structures 
were predicted with a combined de novo and homology- 
based approach. Firstly, for all repeat-masked scaffolds lar- 
ger than 1 Kb, Fgenesh (v 3.1.2)[28] and GeneMark-ES 
(v2.3e)[29] were performed on the whole genomic 
sequence to provide an initial set of predicted ORFs. Fge- 
nesh is trained on sequences of Penicillium funiculosum 
and GeneMark-ES is upon the self-training algorithm for 
fungal genomes. Preference was given to Fgenesh genes, 
and all predicted protein should be larger than lOaa. 
Secondly, we use the gene prediction results of P. chryso- 
genum Wisconsin54-1255 to revise and complement our 
predicted genes by homology searching. At last, the former 
two results were integrated together as predicted genes. 
All predicted proteins were blastp [50] against databases 
of GenBank's non- redundant proteins, InterProscan [30], 
Swiss-Prot/UniProtKB [32] and Gene Ontology (GO), and 
the best alignment of every protein was considered its 
annotation. No alignment results by blastp of predicted 
proteins were automatically considered as hypothetical 
proteins. We presented unique and shared proteins 
among four gene annotation systems by venn diagram 
(http://bioinfogp.cnb.csic.es/tools/venny/index.html). 
WEGO [51] was used to plot GO annotation results. The 
pathway analysis is carried out by KAAS (v 1.67x) with 
SBH method [52]. 

Calling single nuclear variations (SNVs), insertions and 
deletions (InDels) and copy number variation (CNVs) 

Based on the assembled P. chrysogenum Wisconsin54-1255, 
we realigned all of the high-quality reads with the genome 
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by SOAP(v 2.21) [53] to identify the SNVs. In the reads 
gap-free alignment process, at most two mismatches were 
allowed between a read and the reference, and best hits 
were selected. Multiple reads mapping results were filtered. 
We use SOAPsnp (vl.03)[54], a statistical model based on 
Bayesian theory and Illumina quality system, to calculate a 
probability for each possible genotype at each position on 
the reference genome. We used five thresholds to filter out 
unreliable SNV results: (1) we required at least five reads 
for each SNV; (2) average quality of each SNV had to be 
higher than 20; (3) the overall depth had to be less than 
150; (4) the approximate copy number of flanking 
sequences had to be less than 2 (to avoid misreading SNV 
caused by the alignment of similar reads from repeat units 
or by copy number variations); (5) there had to be at least 
one pair of mate-pair reads to support For InDel detection, 
we use Pindel(v0.2.4) [55] to find breakpoints of large dele- 
tions and medium-sized insertions firom paired-end or sin- 
gle reads. The short reads alignment is by BWA- backtrack 
(vO.5.9) [56] and long reads is by BWA-SW [57]. We use 
SAM Tools(v0.1.17) [58] to manipulate alignments in the 
SAM format. Copy number variations detection is by 
CNV-seq [59] which is based on a robust statistical model 
with 50x high-quality reads from Illumina HiSeq 2000. 

Identification of structural variations 

We used Blat (v 34) [60] with default parameter to align 
scaffolds of P. chrysogenum NCPC10086 to the reference, 
whole genome of P. chrysogenum Wisconsin54-1255, to 
search colinearity between them. The alignment results of 
each scaffold indicate a candidate location of the scaffold. 
For scaffolds with multiple hits, the top ten hits with high- 
est sequence similarity remained as candidate locations. 
The alignment with the longest matches in a linear orien- 
tation between a scaffold and the reference was picked as 
'best-hit' of the scaffold. After finding structural variations, 
we use Blastn with parameter 'le'^' to check the detail 
alignment results. We randomly pick up 20 x mate-pair 
reads fi-om ISOx high-quality Illumina HiSeq 2000 reads. 
Reads mapping is by SOAP (v 2.21) [53] and the alignment 
result is visualized by MapView (v 3.4.1) [61]. The 5'-3' 
primers of PCR identification of structural variations of 
"266 Kb translocation" are ACCTGGCGTGCCTCATG- 
CAGCG and TTGGGGTGGAATGACGTGGGG, which 
are before 200 bp and after 300 bp of the breakpoint. The 
5'-3' primers of PCR identification of structural variations 
of "1,202 Kb translocation" are ACCTGTGGGGATCAT- 
TAGCCTCC and ACTCGGATAGTCTAGGTTCGGCGG, 
which are before 250 bp and after 220 bp of the breal<point. 

Availability of supporting data 

P. chrysogenum strain NCPC10086 genome sequences 
are available via GenBank/EMBL/DDBJ under the acces- 
sion APKGOOOOOOOO. 
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Additional file 1: The list of polymorphic genes with non- 
synonymous mutations. 
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